Add Megatron-LM cross-entropy integration by PrathyushaPolepalli · Pull Request #1207 · linkedin/Liger-Kernel

PrathyushaPolepalli · 2026-04-28T05:59:09Z

Summary

Adds `apply_liger_kernel_to_megatron()` monkey-patch that swaps Megatron-LM's native `fused_vocab_parallel_cross_entropy` for Liger's Triton cross-entropy kernel.

  from liger_kernel.megatron import apply_liger_kernel_to_megatron                                                                                                                                                                                          
                                                                                                                                                                                                                                                            
  apply_liger_kernel_to_megatron(
      ignore_index=-100,                                                                                                                                                                                                                                    
      label_smoothing=cfg.label_smoothing_factor,                                                      
  )

Enables online softmax + in-place gradients + no full-softmax materialization inside Megatron training pipelines.

Scope: tensor_model_parallel_size=1 only. With TP>1, each rank holds a sharded [N, V/tp] logits slice and CE requires cross-rank all-reduces that Liger's kernel does not perform.

The patch raises RuntimeError at patch time (via megatron.core.parallel_state) and again at call time (via the tp_group argument Megatron passes), so misconfiguration fails loudly. Vocab-parallel support is follow-up work.

Tested on Qwen3-30B-A3B scaled MoE, 1× H100_8, BF16:

Model config:

24 layers, hidden=1024, FFN hidden=6144
128 experts, top-8 routing, MoE FFN hidden=768
~7.8B total params, ~0.8B active per token
Vocab size: 151,936
Sequence length: 4096

Parallelism:

Tensor Parallel (TP): 1
Pipeline Parallel (PP): 1
Expert Parallel (EP): 8 (16 experts per GPU)
Data Parallel (DP): 8 (non-expert), 1 (expert)

Training config:

Global batch size: 1024, Micro batch size: 2
Distributed Adam optimizer
Selective activation recompute (core_attn, mlp)
--cross-entropy-loss-fusion enabled

Numerical correctness: lm_loss ~4.1e-3 in both, no NaN/skipped iterations.
Variance: Liger CE 107.7-109.1 TFLOP/s/GPU (consistent).

megatron_cross_entropy_memory_full_token_length

megatron_cross_entropy_speed_backward_token_length

megatron_cross_entropy_speed_forward_token_length

megatron_cross_entropy_speed_full_token_length

Test setup: Single H100 80GB, sequence length S=2048, batch size B=4, vocab sizes 4K → 131K. Each provider is the same cross-entropy operation, just different implementations:

liger — apply_liger_kernel_to_megatron() patch (Liger's Triton kernel)
torch — standard torch.nn.functional.cross_entropy
megatron — Megatron's native fused_vocab_parallel_cross_entropy

Testing Done

Hardware Type: H100
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Mecoli1219

Overall looks great! Excited to support Megatron with Liger. Left some comments to address.

Mecoli1219 · 2026-05-13T16:20:47Z

+    if tp_size > 1:
+        raise RuntimeError(
+            f"apply_liger_kernel_to_megatron currently requires tensor_model_parallel_size=1, "
+            f"got {tp_size}. Vocab-parallel cross-entropy support is planned as follow-up work."
+        )


This is a constrain that need to be addressed in the future given that TP is a common use case in Megatron, but it's a great start supporting megatron!

BTW, does this patching also not support other parallel strategy? (Sequence Parallel, etc)

It feels a bit awkward to me to have patching and function wrapping logics in liger side. Surely it is a simpler way to use liger's ce without touching megatron codebase. However, if supporting megatron framework is not in our roadmap, and not going to add it to our test suite in a short time, it will be quite inconvenient to maintain this support in a long run. WDYT?

BTW, megatron's SP requires TP>1

Mecoli1219 · 2026-05-13T16:23:16Z

+        global _ACTIVATION_LOGGED
+        if not _ACTIVATION_LOGGED:


Is this necessary?

Mecoli1219 · 2026-05-13T18:12:31Z

+    return liger_fused_vocab_parallel_cross_entropy
+
+
+def apply_liger_kernel_to_megatron(


Can we move it to another file like monkey_patch.py under the same directory? If we want to add more kernel besides CE, it would be cleaner to separate the framework-level and kernel-specific logic. You can mirror src/liger_kernel/trainsformers/:

src/liger_kernel/metatron/ monkey_patch.py # apply_liger_kernel_to_megatron + TP check cross_entropy.py # _build_wrapper + _patch_fused_vocab_parallel_ce other_future_kernel.pys

PrathyushaPolepalli marked this pull request as draft April 28, 2026 05:59

PrathyushaPolepalli force-pushed the megatron-cross-entropy-integration branch 5 times, most recently from ed3c27e to b1fa5bc Compare April 29, 2026 23:35

PrathyushaPolepalli marked this pull request as ready for review April 30, 2026 16:27

PrathyushaPolepalli force-pushed the megatron-cross-entropy-integration branch from b1fa5bc to 41362ee Compare April 30, 2026 17:07

Add Megatron-LM cross-entropy integration

e4b2ff2

PrathyushaPolepalli force-pushed the megatron-cross-entropy-integration branch from 41362ee to e4b2ff2 Compare April 30, 2026 17:11

Mecoli1219 requested changes May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Megatron-LM cross-entropy integration#1207

Add Megatron-LM cross-entropy integration#1207
PrathyushaPolepalli wants to merge 1 commit into
linkedin:mainfrom
PrathyushaPolepalli:megatron-cross-entropy-integration

PrathyushaPolepalli commented Apr 28, 2026 •

edited

Loading

Uh oh!

Mecoli1219 left a comment

Uh oh!

Mecoli1219 May 13, 2026

Uh oh!

Tcc0403 May 15, 2026

Uh oh!

Tcc0403 May 15, 2026

Uh oh!

Mecoli1219 May 13, 2026

Uh oh!

Mecoli1219 May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return liger_fused_vocab_parallel_cross_entropy


		def apply_liger_kernel_to_megatron(

Conversation

PrathyushaPolepalli commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

Mecoli1219 left a comment

Choose a reason for hiding this comment

Uh oh!

Mecoli1219 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Tcc0403 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Tcc0403 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Mecoli1219 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Mecoli1219 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PrathyushaPolepalli commented Apr 28, 2026 •

edited

Loading